#PureEmotion
#PureEmotion
- 0.1 Preliminaries
- 1 Task 1: Data Source
- 2 Task 2: Load and Tidy the Data
- 3 Task 3: Listing of My Tibble
- 4 Task 4: Code Book
- 5 Task 5: My Subjects
- 6 Task 6: My Variables
- 7 Task 7: My Planned Linear Regression Model
- 8 Task 8: My Planned Logistic Regression Model
- 9 Task 9: Affirmation
- 10 Tweeting Habits
- 11 Sentiment Analysis
0.1 Preliminaries
1 Task 1: Data Source
The source of my data is a .csv file obtained directly from the settings page of my personal twitter account (@McDonnellJack). I am heavily indebted to Julia Silge (@juliasilge) for sharing the basic code necessary for a project like this.
2 Task 2: Load and Tidy the Data
tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)
# use lubridate to convert time stamps to date/time objects
tweets$timestamp <- ymd_hms(tweets$timestamp)
tweets$timestamp <- with_tz(tweets$timestamp, "America/New_York")
#this creates a variable with time only (getting rid of month and year)
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))
#creating a variable for type of tweet (tweet, retweet, reply)
tweets$type <- "tweet"
tweets[(!is.na(tweets$retweeted_status_id)),12] <- "RT"
tweets[(!is.na(tweets$in_reply_to_status_id)),12] <- "reply"
tweets$type <- as.factor(tweets$type)
tweets$type = factor(tweets$type,levels(tweets$type)[c(3,1,2)])
#creating a variable for number of characters in each tweet
tweets$charsintweet <- sapply(tweets$text, function(x) nchar(x))
#remove handles from tweets in preparation for sentiment analysis
nohandles <- str_replace_all(tweets$text, "@\\w+", "")
# exclude useless variables
source <- names(tweets) %in% c("expanded_urls","timeonly","source", "V12")
tweets <- tweets[!source]
# get sample of 1000
set.seed(02202018)
tweets_sample <- sample_n(tweets, size = 1000)
write.csv(tweets_sample,'tweets_sample.csv')Ultimately, the tidying I do here is to set the stage for the graphical analysis and modeling that will come later. The .csv that twitter provides for each user (from the “settings” section on an individual’s twitter page) is a bit clunky and hard to use, especially with respect to the timestamp. As a result, the first part of the above chunk basically generates date/time variables that are easier to work with. The next section lays down some code that helps prepare the text of my tweets for sentiment analysis. The final section obtains a random sample of 1000 variables as required for this assignment.
3 Task 3: Listing of My Tibble
glimpse(tweets_sample)Observations: 1,000
Variables: 10
$ tweet_id <dbl> 5.561837e+17, 4.942579e+17, 9.17178...
$ in_reply_to_status_id <dbl> NA, NA, NA, NA, NA, 5.897073e+17, N...
$ in_reply_to_user_id <dbl> NA, NA, NA, NA, NA, 1.103924e+09, N...
$ timestamp <dttm> 2015-01-16 15:18:16, 2014-07-29 19...
$ text <chr> "At the peds conference. All the ex...
$ retweeted_status_id <dbl> NA, 4.942566e+17, 9.171154e+17, NA,...
$ retweeted_status_user_id <dbl> NA, 8.522889e+07, 2.950125e+07, NA,...
$ retweeted_status_timestamp <chr> "", "2014-07-29 23:02:10 +0000", "2...
$ type <fct> tweet, RT, RT, tweet, RT, reply, tw...
$ charsintweet <int> 134, 140, 140, 106, 140, 69, 32, 14...
My tidy tibble, called tweets_sample, has a total of 1000 observations (a random sample of the original 1,665 tweets). There are a total of 10 variables. Note that these variables form the basis of my simple/tidy dataset, but from them, using the syuzhet package in R, I’m going to later derive some additional variables that will be necessary for my analysis.
4 Task 4: Code Book
| Variable | Type | Details |
|---|---|---|
tweet_id |
character | identification code of tweet |
in_reply_to_status_id |
character | identification code for tweets which I responded to |
in_reply_to_user_id |
character | identification code for users whom I responded to |
timestamp |
year-month-day-time | time stamp |
text |
character | actual text of tweet |
retweeted_status_id |
character | identification code for tweets which I retweeted |
retweeted_status_user_id |
character | identification code for users I retweeted |
retweeted_status_timestamp |
character | time stamp for tweets I retweeted |
type |
factor | type of tweet (tweet, retweet, reply) |
charsintweet |
numeric | number of characters in each tweet |
5 Task 5: My Subjects
The rows of my data set represent individual tweets I have sent out from my personal twitter account, @McDonnellJack.
6 Task 6: My Variables
tweet_idis a character variable, assigned by Twitter. In it, each tweet is given a distinctive identification numberin_reply_to_status_idis another character variable, again assigned by Twitter. If I sent a tweet in reply to another tweet, Twitter captures the identification number of that other tweet here.in_reply_to_user_idis another character variable that was assigned by Twitter. For tweets I replied to, this variable captures the user identification number of author of the original tweet.timestampis a time variable, corresponding to when my tweet went out, from the standpoint of Eastern Standard Time. It follows the format of, for example, “2018-02-23 13:00:00”.textis a character variable, containing the actual text of the sample of 1000 tweets I will be analyzing.retweeted_status_idis another character variable assigned by Twitter. If I retweeted another user’s tweet, Twitter captures the identification number of that other tweet here.retweeted_status_user_idis another character variable assigned by Twitter. If I retweeted another user’s tweet, Twitter captures the identification number of that other user here.retweeted_status_timestampis another character variable assigned by Twitter. If I retweeted another user’s tweet, Twitter captures the timestamp of that original tweet here, in the same date/time format referenced in the description oftimestampabove.typeis a factor variable assigned by Twitter. It has three levels: “tweet”, “reply”, and “RT”. The last of these indicates a retweet. 23.3% of my tweets were just regular tweets, 24.2% were replies to another user, and 52.5% were retweets.charsintweetis a numeric/integer variable. It counts the number of characters in each of my tweets. It takes on a range of 9 to 314.
Note that my data has no true missing values, except for those missing by design. For example, consider that some variables like retweeted_status_timestamp will have missing values for tweets that were not a retweets.
7 Task 7: My Planned Linear Regression Model
To accomplish my linear model, I will need to put in a bit of preparatory code.
#convert text of my tweets into a form that can be analyzed by 'tm' package
wordCorpus <- Corpus(VectorSource(nohandles))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, removeWords, c("amp", "2yo", "3yo", "4yo"))
wordCorpus <- tm_map(wordCorpus, stripWhitespace)
#sentiment analysis
mySentiment <- get_nrc_sentiment(tweets$text)
#combine sentiment analysis with tweets
tweets <- cbind(tweets, mySentiment)
#compute a total sentiment score
sentimentTotals <- data.frame(colSums(tweets[,c(10:19)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
# time stamp tweets to be able to analyze the sentiment as a function of time
tweets$timestamp <- with_tz(ymd_hms(tweets$timestamp), "America/New_York")
posnegtime <- tweets %>%
group_by(timestamp = cut(timestamp, breaks="2 months")) %>%
summarise(negative = mean(negative),
positive = mean(positive)) %>% melt
names(posnegtime) <- c("timestamp", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])
# create new variable for considering tweet sentiment by weekday
tweets$weekday <- wday(tweets$timestamp, label = TRUE)
weeklysentiment <- tweets %>% group_by(weekday) %>%
summarise(anger = mean(anger),
anticipation = mean(anticipation),
disgust = mean(disgust),
fear = mean(fear),
joy = mean(joy),
sadness = mean(sadness),
surprise = mean(surprise),
trust = mean(trust)) %>% melt
names(weeklysentiment) <- c("weekday", "sentiment", "meanvalue")
# create new variable for considering tweet sentiment by month
tweets$month <- month(tweets$timestamp, label = TRUE)
monthlysentiment <- tweets %>% group_by(month) %>%
summarise(anger = mean(anger),
anticipation = mean(anticipation),
disgust = mean(disgust),
fear = mean(fear),
joy = mean(joy),
sadness = mean(sadness),
surprise = mean(surprise),
trust = mean(trust)) %>% melt
names(monthlysentiment) <- c("month", "sentiment", "meanvalue")For my linear regression model, I would like to look at how the length of a tweet are affected by its sentiment, type of tweet, and date/time factors. Specifically, I will run a linear model with outcome variable (quantitative) of charsintweet, with predictors of positive (a measure of overall positivity of a tweet achieved by thesyuzhet package as above), weekday (derived from timestamp), month (derived from timestamp), and type.
8 Task 8: My Planned Logistic Regression Model
hashtag <- factor(grepl("#", tweets$text))For my logistic regression model, I will use hashtag (corresponding to the presence or absence of hashtags in a tweet) as my outcome variable. Predictors in this case will be type, month, weekday, and year. Maybe I’ll discover something interesting about my own tweeting habits!
9 Task 9: Affirmation
I am certain that it is completely appropriate for these data to be shared with anyone, without any conditions. There are no concerns about privacy or security.
10 Tweeting Habits
10.1 Tweets Over Time
Here’s a histogram of my tweets over time.
ggplot(data = tweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")My tweeting has really taken off since 2017.
10.2 Tweets by Year
ggplot(data = tweets, aes(x = year(timestamp))) +
geom_histogram(breaks = seq(2010, 2018.5, by =1), aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")10.3 Tweets by Day
ggplot(data = tweets, aes(x = wday(timestamp, label = TRUE))) +
geom_bar(breaks = seq(0.5, 7.5, by =1), aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Day of the Week") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")Lots of “hump-day” tweeting, for some reason . . .
10.4 Tweets by Month
ggplot(data = tweets, aes(x = month(timestamp, label = TRUE))) +
geom_bar(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Month") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")10.5 Tweets by Time of Day
tweets$timeonly <- as.numeric(tweets$timestamp - trunc(tweets$timestamp, "days"))
class(tweets$timeonly) <- "POSIXct"
ggplot(data = tweets, aes(x = timeonly)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") +
scale_x_datetime(breaks = date_breaks("3 hours"),
labels = date_format("%H:00")) +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")I can’t figure out why I can’t get the x axis properly labelled, but will work on it . . .
10.6 Late Night Tweets
latenighttweets <- tweets[(hour(tweets$timestamp) < 6),]
ggplot(data = latenighttweets, aes(x = timestamp)) +
geom_histogram(aes(fill = ..count..)) +
theme(legend.position = "none") +
xlab("Time") + ylab("Number of tweets") + ggtitle("Late Night Tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")10.7 Hashtagging
ggplot(tweets, aes(factor(grepl("#", tweets$text)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Tweets with Hashtags") +
scale_x_discrete(labels=c("No hashtags", "Tweets with hashtags"))Turns out I’m not a big fan of the hashtag.
10.8 Retweets
ggplot(tweets, aes(factor(!is.na(retweeted_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Retweeted Tweets") +
scale_x_discrete(labels=c("Not retweeted", "Retweeted tweets"))I LOVE to retweet.
10.9 Replied Tweets
ggplot(tweets, aes(factor(!is.na(in_reply_to_status_id)))) +
geom_bar(fill = "midnightblue") +
theme(legend.position="none", axis.title.x = element_blank()) +
ylab("Number of tweets") +
ggtitle("Replied Tweets") +
scale_x_discrete(labels=c("Not in reply", "Replied tweets"))I was suprised at how little I replied. Thought it would be a lot more!
10.10 Tweeting Style: Changes Over Time
ggplot(data = tweets, aes(x = timestamp, fill = type)) +
geom_histogram() +
xlab("Time") + ylab("Number of tweets") +
scale_fill_manual(values = c("midnightblue", "deepskyblue4", "aquamarine3"))Although I’ve always done it, since the election of @RealDonaldTrump I’ve started retweeting more.
10.11 Character Counts
ggplot(data = tweets, aes(x = charsintweet)) +
geom_histogram(aes(fill = ..count..), binwidth = 8) +
theme(legend.position = "none") +
xlab("Characters per Tweet") + ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4")I typically tweeted the max 140 characters each time, never really adjusted to twitter raising the limit.
11 Sentiment Analysis
11.1 Word Cloud
Now, it might be interesting to see what kind of word cloud my tweets generate.
pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)
wordcloud(words = wordCorpus, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.35, use.r.layout=FALSE, colors=pal)Ugh. I could of predicted this.
11.2 Friends
friends <- str_extract_all(tweets$text, "@\\w+")
namesCorpus <- Corpus(VectorSource(friends))
set.seed(146)
wordcloud(words = namesCorpus, scale=c(3,0.5), max.words=40, random.order=FALSE,
rot.per=0.10, use.r.layout=FALSE, colors=pal)I’ll let you know when I figure out who “character is”. The rest of the people I tweet at the most make some sense to me, personally.
sentimentTotals <- data.frame(colSums(tweets[,c(11:18)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Score for All Tweets")I’m surprised I am so trusting.
11.3 Sentiment Over Time
tweets$timestamp <- with_tz(ymd_hms(tweets$timestamp), "America/New York")
posnegtime <- tweets %>%
group_by(timestamp = cut(timestamp, breaks="2 months")) %>%
summarise(negative = mean(negative),
positive = mean(positive)) %>% melt
names(posnegtime) <- c("timestamp", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])
ggplot(data = posnegtime, aes(x = as.Date(timestamp), y = meanvalue, group = sentiment)) +
geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
geom_point(size = 0.5) +
ylim(0, NA) +
scale_colour_manual(values = c("springgreen4", "firebrick3")) +
theme(legend.title=element_blank(), axis.title.x = element_blank()) +
scale_x_date(breaks = date_breaks("9 months"),
labels = date_format("%Y-%b")) +
ylab("Average sentiment score") +
ggtitle("Sentiment Over Time")If you buy me a beer, I’ll explain what was going on in my life that made me so negative in May, 2015.
11.4 Sentiment by Day of Week
tweets$weekday <- wday(tweets$timestamp, label = TRUE)
weeklysentiment <- tweets %>% group_by(weekday) %>%
summarise(anger = mean(anger),
anticipation = mean(anticipation),
disgust = mean(disgust),
fear = mean(fear),
joy = mean(joy),
sadness = mean(sadness),
surprise = mean(surprise),
trust = mean(trust)) %>% melt
names(weeklysentiment) <- c("weekday", "sentiment", "meanvalue")
ggplot(data = weeklysentiment, aes(x = weekday, y = meanvalue, group = sentiment)) +
geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
geom_point(size = 0.5) +
ylim(0, 0.6) +
theme(legend.title=element_blank(), axis.title.x = element_blank()) +
ylab("Average sentiment score") +
ggtitle("Sentiment During the Week")Mondays ARE pretty disgusting . . .
11.5 Sentiment by Month
tweets$month <- month(tweets$timestamp, label = TRUE)
monthlysentiment <- tweets %>% group_by(month) %>%
summarise(anger = mean(anger),
anticipation = mean(anticipation),
disgust = mean(disgust),
fear = mean(fear),
joy = mean(joy),
sadness = mean(sadness),
surprise = mean(surprise),
trust = mean(trust)) %>% melt
names(monthlysentiment) <- c("month", "sentiment", "meanvalue")
ggplot(data = monthlysentiment, aes(x = month, y = meanvalue, group = sentiment)) +
geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
geom_point(size = 0.5) +
ylim(0, NA) +
theme(legend.title=element_blank(), axis.title.x = element_blank()) +
ylab("Average sentiment score") +
ggtitle("Sentiment During the Year")As a military officer, “PCS” season (where you move bases) started in May. That’s why I think I was so fearful :(